import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import kaggle
from zipfile import ZipFile
Read in data from Kaggle, view random samples of data and get shape of training and testing sets:
NOTE: I made the burdensome mistake here of keeping train and test sets separate and doing imputation and feature engineering for each dataset. I used the same methods on both sets, so it's equivalent to as if I had concatenated them first and applied all imputation and feature engineering methods on both sets. I just created some extra work for myself, thinking that keeping the sets separate was a good idea to observe any potential differences between the two. In hindsight, I would have concatenated the datasets first and worked off of just the concatenated set.
!kaggle competitions download -c house-prices-advanced-regression-techniques
zipped = ZipFile('house-prices-advanced-regression-techniques.zip')
zipped.extractall()
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
pd.set_option('display.max_columns',100)
train.sample(5)
test.sample(5)
train.shape
test.shape
The training set has 81 columns, one of which is the target variable (SalePrice). The testing set has 80 columns, one for each feature. Lots of features, so let's summarize:
def create_summary(df):
summary = pd.DataFrame()
summary['dtype'] = df.dtypes
summary['count'] = df.apply(lambda x: x.count())
summary['missing'] = df.apply(lambda x: x.isnull().sum())
summary['unique'] = df.apply(lambda x: [x.unique()])
return summary
def summarize_missing(df):
summary = pd.DataFrame()
summary['dtype'] = df.dtypes
summary['count'] = df.apply(lambda x: x.count())
summary['missing'] = df.apply(lambda x: x.isnull().sum())
summary['unique'] = df.apply(lambda x: [x.unique()])
return summary[summary.missing > 0].sort_values(by='missing', ascending=False)
summary_train = create_summary(train)
summary_test = create_summary(test)
summary_train.head()
summary_test.head()
Drop the ID field since it doesn't help us as a feature:
train = train.drop(columns=['Id'])
test = test.drop(columns=['Id'])
Look at features with missing values first, think about how to impute values if applicable:
summarize_missing(train)
summarize_missing(test)
PoolQC, MiscFeature, Alley, Fence, and FireplaceQu likely have so many missing values because there are houses where these features are not present. It would make sense that most houses in Ames, IA do not have pools. Looking at the data description file, it does look like the nan values for these features correspond to "None", so impute with "None" for now:
def fillna_custom(df, features, fill_value):
df_new = df.copy()
for feature in features:
df_new[feature] = df_new[feature].fillna(fill_value)
return df_new
train_old = train
test_old = test
missing_features = ['PoolQC','MiscFeature','Alley','Fence','FireplaceQu']
train = fillna_custom(train_old, missing_features, "None")
test = fillna_custom(test_old, missing_features, "None")
All of the Garage features in the training set have 81 missing values. Check that all of these correspond to the same rows. If so, we can say that all of these correspond to houses with no garages, consistent with the data description document. We can impute GarageType, GarageFinish, GarageQual and GarageCond with "None":
def mismatch_nas(df, features):
df_na = df[features]
df_na['numNA'] = df_na.apply(lambda x:x.isna().sum(), axis=1)
return df_na[(df_na['numNA'] < len(features)) & (df_na['numNA'] > 0)]
pd.options.mode.chained_assignment = None
garage_missing_features = ['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond']
mismatch_nas(train, garage_missing_features)
garage_missing_cat_features = ['GarageType','GarageFinish','GarageQual','GarageCond']
train = fillna_custom(train, garage_missing_cat_features, "None")
What about GarageYrBlt? It would make sense to impute this with YearBuilt, since garages are typically built along with the house:
train = fillna_custom(train, ['GarageYrBlt'], train['YearBuilt'])
train[garage_missing_features][train['GarageType']=="None"]
summarize_missing(train)
Garage features look good for the training set now. However, it looks like the testing set has some entries where this is not the case (some garage features are NaN, while others are not:
mismatch_nas(test, garage_missing_features)
Look at the GarageCars and GarageArea to see if these are actually garages:
test.loc[[666,1116], ['GarageCars','GarageArea']]
Index 666 looks like a real garage, but 1116 is suspect. It is the only entry in the testing set with missing GarageCars and GarageArea. I will treat 1116 as a house falsely marked with a detached garage and impute all values as NaN or 0:
def single_row_impute(df, index, features, value):
df_new = df.copy()
for feature in features:
df_new.loc[index, feature] = value
return df_new
test = single_row_impute(test, 1116, garage_missing_cat_features, "None")
test = single_row_impute(test, 1116, ['GarageCars','GarageArea'], 0)
test = single_row_impute(test, 1116, ['GarageYrBlt'], test.loc[1116, 'YearBuilt'])
test.loc[1116,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars','GarageArea']]
For index 666, let's impute GarageFinish, GarageQual, and GarageCond with the most frequently occurring value for detached garages. Impute GarageYrBlt with YearBuilt, same as before:
test = single_row_impute(test, 666, ['GarageYrBlt'], test.loc[666, 'YearBuilt'])
test[test['GarageType']=="Detchd"]['GarageFinish'].value_counts()
test[test['GarageType']=="Detchd"]['GarageQual'].value_counts()
test[test['GarageType']=="Detchd"]['GarageCond'].value_counts()
test = single_row_impute(test, 666, ['GarageFinish'], 'Unf')
test = single_row_impute(test, 666, ['GarageQual'], 'TA')
test = single_row_impute(test, 666, ['GarageCond'], 'TA')
test.loc[666,['GarageType','GarageYrBlt','GarageFinish','GarageQual','GarageCond','GarageCars','GarageArea']]
Looks good, let's impute the rest of the garage features in the test set now:
test = fillna_custom(test, garage_missing_cat_features, "None")
test = fillna_custom(test, ['GarageYrBlt'], test['YearBuilt'])
test[garage_missing_features][test['GarageType']=="None"]
summarize_missing(test)
The garage features are now all taken care of, moving on to the next set...
Looking at the basement features in the training and testing sets, there appear to be a few mismatches where some values are NA but others are not:
basement_features = ['BsmtCond','BsmtExposure','BsmtQual','BsmtFinType1','BsmtFinType2']
mismatch_nas(train, basement_features)
Looking at square footages to get a better picture here:
basement_features_all = basement_features+['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF']
train.loc[[332,948], basement_features_all]
I'll impute BsmtFinType2 for index 332 with GLQ to match BsmtFinType1:
train.loc[332,'BsmtFinType2'] = 'GLQ'
For BsmtExposure for index 948, see what the most frequently occurring value is for unfinished basements:
train[train.BsmtFinType1=="Unf"]['BsmtExposure'].value_counts()
train.loc[948,'BsmtExposure'] = 'No'
mismatch_nas(train, basement_features)
No more mismatches across NAs, so we can impute the values column by column. As done with the garage features, impute any NA to "None", since these are defined by houses having no basement:
train = fillna_custom(train, basement_features, "None")
summarize_missing(train)
Training set looks good now, let's look at the testing set:
mismatch_nas(test, basement_features)
index_missing = list(mismatch_nas(test, basement_features).index)
test.loc[index_missing, basement_features_all]
There are missing values for BsmtCond, BsmtExposure, and BsmtQual. Impute BsmtExposure in the same way we did for the training set:
test.loc[[27,888],'BsmtExposure'] = 'No'
Can we use BsmtCond to impute BsmtQual and vice versa?
pd.crosstab(test.BsmtCond, test.BsmtQual)
test.BsmtCond.value_counts()
test.BsmtQual.value_counts()
For BsmtCond, the safest bet is to impute "TA", which makes up the vast majority of samples:
test.loc[[580,725,1064],'BsmtCond'] = "TA"
Looking at the crosstab matrix BsmtQual seems to correlate with BsmtCond, especially for BsmtCond = "Gd". But we don't have that type of missing value here (only "TA" and "Fa"), so it seems best to impute with "TA" here:
test.loc[[757,758],'BsmtQual'] = "TA"
Impute the remaining values as we did before:
test = fillna_custom(test, basement_features, "None")
Other basement features were missing 1 or 2 values as well in the testing set, let's just take care of them here:
test[test.BsmtFinSF1.isna()][basement_features_all]
The missing basement square footages should all be zero, since there is no basement for this house:
test = single_row_impute(test, 660, ['BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF'], 0)
Looking at the missing basement bathrooms:
test[test.BsmtFullBath.isna()][basement_features_all+['BsmtFullBath','BsmtHalfBath']]
No basement in these cases either, so impute with zero:
test = single_row_impute(test, [660,728], ['BsmtFullBath','BsmtHalfBath'], 0)
summarize_missing(test)
masvnr_features = ['MasVnrType','MasVnrArea']
mismatch_nas(train, masvnr_features)
No mismatches in missing values, let's look at all of them:
train[train.MasVnrType.isna()]
train.MasVnrType.value_counts()
Is there a relationship between the Exterior features and MasVnrType?
pd.crosstab(train.Exterior1st, train.MasVnrType)
pd.crosstab(train.Exterior2nd, train.MasVnrType)
Doesn't look like it. Let's just impute with "None" and 0 for the area, since those are the most common values.
train = fillna_custom(train, ['MasVnrType'], "None")
train = fillna_custom(train, ['MasVnrArea'], 0)
summarize_missing(train)
Now looking at the testing set:
mismatch_nas(test, masvnr_features)
Impute the missing MasVnrType here with BrkFace, which is the most common type that isn't "None". Since the area is non-zero, it doesn't make sense to impute "None":
test = single_row_impute(test, 1150, ['MasVnrType'], "None")
mismatch_nas(test, masvnr_features)
Looking at all of the missing values now:
test[test.MasVnrType.isna()]
Use the same method as the training set to impute:
test = fillna_custom(test, ['MasVnrType'], "None")
test = fillna_custom(test, ['MasVnrArea'], 0)
summarize_missing(test)
Let's look at the LotConfig variable for these missing values to see if imputing with zero makes sense (i.e. the house doesn't have a street in front of it):
all_data = pd.concat([train.iloc[:,0:train.shape[1]-1],test]).reset_index(drop=True)
all_data[all_data.LotFrontage.isna()]['LotConfig'].value_counts()
pd.set_option('display.max_rows', 100)
all_data.LotConfig.value_counts()
Unfortunately LotConfig doesn't seem to help much. Intuitively, can we use the square root of LotArea?
plt.scatter(np.sqrt(all_data.LotArea), all_data.LotFrontage)
plt.plot(np.linspace(0,500), np.linspace(0,500))
There is a lot of fanning, which makes sense (lots can be wider with more frontage, or longer with less frontage). Not every lot will be perfectly square. Let's group by other features to see if imputing with the average is appropriate:
all_data.groupby('LotConfig').mean()[['LotFrontage']]
all_data[['LotConfig','LotFrontage']].boxplot(by='LotConfig', figsize=(8,8))
all_data.groupby('MSZoning').mean()[['LotFrontage']]
all_data[['MSZoning','LotFrontage']].boxplot(by='MSZoning', figsize=(8,8))
all_data.groupby('Neighborhood').mean()[['LotFrontage']]
all_data[['Neighborhood','LotFrontage']].boxplot(by='Neighborhood', figsize=(16,8))
all_data[all_data.LotFrontage.isna()]['Neighborhood'].value_counts()
all_data.Neighborhood.value_counts()
There looks to be the most variability in LotFrontage by Neighborhood, which makes sense - some neighborhoods will be more dense than others. I think this is an improvement over imputing by the mean or median grouped by LotConfig or MSZoning. Let's go ahead and impute LotFrontage by the median of the neighborhood's LotFrontage (not mean, since this is more sensitive to outliers):
frontage_median = all_data.groupby(['Neighborhood']).median()[['LotFrontage']]
train['LotFrontage'] = train.apply(lambda x:frontage_median.loc[x.Neighborhood].values[0] if np.isnan(x.LotFrontage) else x.LotFrontage, axis=1)
test['LotFrontage'] = test.apply(lambda x:frontage_median.loc[x.Neighborhood].values[0] if np.isnan(x.LotFrontage) else x.LotFrontage, axis=1)
summarize_missing(train)
summarize_missing(test)
The Electrical feature in the training set is only missing one value:
train[train.Electrical.isna()]
This looks to be missing completely at random. Impute with the most commonly occurring value:
train.Electrical.value_counts()
train = fillna_custom(train, ['Electrical'], "SBrkr")
summarize_missing(train)
Training set is all good now, no missing values. The testing set is missing values from a few more features:
summarize_missing(test)
test_missing_features = list(summarize_missing(test).index)
test[test_missing_features][test.isna().any(axis=1)]
The missing values look like they are scattered and not consistent across all features. Let's look at each one by one:
Utilities is easy, since the only value in the entire dataset is AllPub. Impute with AllPub:
test.Utilities.value_counts()
test = fillna_custom(test, ['Utilities'], "AllPub")
There does appear to be a kitchen in the house with the missing KitchenQual, so we should impute with an existing value and not "None". Since "TA" is the most common value, let's just use that:
test.KitchenQual.value_counts()
test.loc[95, ['KitchenAbvGr','KitchenQual']]
test = fillna_custom(test, ['KitchenQual'], "TA")
Same treatment for SaleType and Functional:
test.SaleType.value_counts()
test = fillna_custom(test, ['SaleType'], "WD")
test.Functional.value_counts()
test = fillna_custom(test, ['Functional'], "Typ")
For the Exterior features, first see if we can get anything from the MasVnrType. If not, just impute with the most frequently occurring value:
test.loc[691,'MasVnrType']
test.Exterior1st.value_counts()
test = fillna_custom(test, ['Exterior1st'], "VinylSd")
test = fillna_custom(test, ['Exterior2nd'], "VinylSd")
For MSZoning the vast majority of entries are "RL", so we can go with that:
test.MSZoning.value_counts()
test = fillna_custom(test, ['MSZoning'], "RL")
summarize_missing(test)
No more missing values in the training and testing sets, let's look at the summaries:
create_summary(train)
create_summary(test)
SalePrice is the target variable, so it would make most sense to see how each numerical feature varies with SalePrice. This is also a good time to consider encoding ordinal categorical variables. I ordered categories taking into account how each value would affect desirability of buying property (e.g. a house on a steep slope would be less desirable than a house on level ground).
train.columns
I've listed all ordinal categorical variables below:
Create a dictionary with the encodings:
# Create separate variable for quality variables since there are so many
qual_encodings = {"None":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5}
qual_features = ['ExterQual','ExterCond','BsmtQual','BsmtCond','HeatingQC','KitchenQual','FireplaceQu','GarageQual','GarageCond','PoolQC']
dict_encode = {}
for feature in qual_features:
dict_encode[feature] = qual_encodings
dict_encode['Street'] = {"Grvl":0, "Pave":1}
dict_encode['Alley'] = {"None":0, "Grvl":1, "Pave":2}
dict_encode['LotShape'] = {"IR3":0, "IR2":1, "IR1":2, "Reg":3}
dict_encode['LandContour'] = {"Low":0, "HLS":1, "Bnk":2, "Lvl":3}
dict_encode['Utilities'] = {"ELO":0, "NoSeWa":1, "NoSewr":2, "AllPub":3}
dict_encode['LandSlope'] = {"Sev":0, "Mod":1, "Gtl":2}
dict_encode['BsmtExposure'] = {"None":0, "No":1, "Mn":2, "Av":3, "Gd":4}
dict_encode['BsmtFinType1'] = {"None":0, "Unf":1, "LwQ":2, "Rec":3, "BLQ":4, "ALQ":5, "GLQ":6}
dict_encode['BsmtFinType2'] = {"None":0, "Unf":1, "LwQ":2, "Rec":3, "BLQ":4, "ALQ":5, "GLQ":6}
dict_encode['CentralAir'] = {"N":0, "Y":1}
dict_encode['Functional'] = {"Sal":0, "Sev":1, "Maj2":2, "Maj1":3, "Mod":4, "Min2":5, "Min1":6, "Typ":7}
dict_encode['GarageFinish'] = {"None":0, "Unf":1, "RFn":2, "Fin":3}
dict_encode['PavedDrive'] = {"N":0, "P":1, "Y":2}
dict_encode['Fence'] = {"None":0, "MnWw":1, "GdWo":2, "MnPrv":3, "GdPrv":4}
def ordinal_encode(df, features, dict_encode):
for feature in features:
df[feature] = df[feature].map(dict_encode[feature])
return df
ord_features = list(dict_encode.keys())
train = ordinal_encode(train, ord_features, dict_encode)
test = ordinal_encode(test, ord_features, dict_encode)
Check that the encoding worked properly:
create_summary(train)
create_summary(test)
Looks good. Now let's look at correlations to SalePrice and some numerical and graphical EDA. The describe method is a good sanity check for all of the features:
train.describe().transpose()
test.describe().transpose()
Numerical correlations with SalePrice in descending order of correlation strength:
train_corr = train.iloc[:,0:train.shape[1]-1].corrwith(train.SalePrice).sort_values(ascending=False)
train_corr
len(train_corr)
def plots_vs_target(df, target, cols, ncol_plot=3):
nrow_plot = int(np.ceil(len(cols)/ncol_plot))
fig, ax = plt.subplots(nrow_plot, ncol_plot, figsize=(16,16/ncol_plot*nrow_plot))
r = 0
c = 0
for col in cols:
ax[r,c].scatter(df[col], df[target])
ax[r,c].set_title(col)
c += 1
if c == ncol_plot:
c = 0
r += 1
plots_vs_target(train, 'SalePrice', list(train_corr.index))
A quick observation here is that MSSubClass is of type int64, but the number acts as a label rather than an actual value. Convert it from int to str:
train['MSSubClass'] = train['MSSubClass'].apply(str)
test['MSSubClass'] = test['MSSubClass'].apply(str)
Some observations from the scatterplots:
Consider the MoSold feature - is a scale from 1 to 12 the wrong way to think about it? Dig a bit into seasonality:
train.boxplot(column='SalePrice', by='MoSold', figsize=(16,8))
It is common belief that housing prices rise in the summer and fall in the winter. What if we map the month to a sine function (trough in the winter, peak in the summer)? This is a common practice done in environmental engineering:
month = np.linspace(1,12,12)
sin_month = np.sin((month-1)*np.pi/11)
plt.plot(month,sin_month)
train_sin_month = train['MoSold'].apply(lambda x:np.sin((x-1)*np.pi/11))
plt.scatter(train_sin_month, train.SalePrice)
train_sin_month.corr(train.SalePrice)
This doesn't help much, in fact the correlation went down slightly. It may be best to convert MoSold to a label, treating it as a categorical variable:
train['MoSold'] = train['MoSold'].apply(str)
test['MoSold'] = test['MoSold'].apply(str)
Before moving on to categorical features, let's see if we can remove some outliers. This is most noticeable in the SalePrice vs. GrLivArea plot:
plt.scatter(train.GrLivArea, train.SalePrice)
train[train.GrLivArea > 4000]
The creator of the Ames housing dataset actually recommends removing all houses with GrLivArea > 4000. I think this makes sense for the two largest houses in the dataset - the SalePrice clearly doesn't correlate with GrLivArea, and the reason for that is these two sales were partial sales (the house was not complete when sold). Let's remove these and look at the scatterplots again (all of them):
train = train[train.GrLivArea < 4000].reset_index(drop=True)
plots_vs_target(train, 'SalePrice', list(train_corr.index))
train.tail()
Now onto categorical variables - use boxplots to visualize how SalePrice varies with categorical features:
cat_cols = list(train.select_dtypes(include=['object']).columns)
cat_cols
def boxplots_vs_target(df, target, cols):
nrow_plot = len(cols)
fig, ax = plt.subplots(nrow_plot, 1, figsize = (16, 6*nrow_plot))
r = 0
for col in cols:
df.boxplot(column='SalePrice', by=col, ax=ax[r])
fig.suptitle('')
r += 1
boxplots_vs_target(train, 'SalePrice', cat_cols)
for col in cat_cols:
print(train[col].value_counts())
print('')
Some observations here:
Converting Electrical to ordinal categorical and looking at the correlation with SalePrice:
elec_encode = {}
elec_encode['Electrical'] = {'Mix':0, 'FuseP':1, 'FuseF':2, 'FuseA':3, 'SBrkr':4}
train = ordinal_encode(train, ['Electrical'], elec_encode)
test = ordinal_encode(test, ['Electrical'], elec_encode)
train.SalePrice.corr(train.Electrical)
ord_features.append('Electrical')
Last step before moving on - look at value counts for ordinal categorical variables:
for col in ord_features:
print(train[col].value_counts())
print('')
Based on this, we should also consider removing Street, Utilities, and PoolQC.
Candidates for feature engineering include:
I'll aim to develop better predictors in feature engineering, and also to develop features that have more explainability than the original ones (in other words, easier to explain the model to an audience). In evaluating engineered vs. original features, I will do a comparison of how each correlates with SalePrice.
Before diving into feature engineering, let's look at a correlation matrix between all variables to see if there is any multicollinearity that we should consider:
cat_cols = list(train.select_dtypes(include=['object']).columns)
num_cols = [x for x in train.columns if x not in cat_cols]
def corr_matrix(df, num_cols):
df_corrs = df[num_cols].corr().round(2)
mask = np.zeros_like(df_corrs, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(30, 30))
cmap = sns.diverging_palette(250, 250, s=100, l=20, as_cmap=True)
sns.heatmap(df_corrs, annot=df_corrs, mask=mask, cmap=cmap, center=0, linewidths=.5, cbar=False)
corr_matrix(train, num_cols)
Taking the correlations and some physical intuition into account, here are some candidates for combined features:
Let's look at these pairs one by one:
plt.scatter(train.YearBuilt, train.GarageYrBlt)
First observation here is that the majority of houses lie on the 1:1 line, meaning that the garage and house were built in the same year. There are also many houses where the garage was built later, and just a few where the garage was built first.
Let's try using ages of when the house was sold rather than years and see how correlations with SalePrice and the cross-correlation between HouseAge and GarageAge look:
HouseAge = train.YrSold-train.YearBuilt
GarageAge = train.YrSold-train.GarageYrBlt
RemodelAge = train.YrSold-train.YearRemodAdd
for feature in [HouseAge, GarageAge, RemodelAge]:
print(train.SalePrice.corr(feature))
Correlations are negative because lower age (newer houses) correlate with higher sale prices. How do these correlate with each other?
print("HouseAge with GarageAge: "+str(HouseAge.corr(GarageAge)))
print("HouseAge with RemodelAge: "+str(HouseAge.corr(RemodelAge)))
print("GarageAge with RemodelAge: "+str(GarageAge.corr(RemodelAge)))
HouseAge with GarageAge is still an issue with high multicollinearity. Try using the difference in age between the garage and house instead of GarageAge:
DiffAge = GarageAge-HouseAge
print("HouseAge with DiffAge: "+str(HouseAge.corr(DiffAge)))
print("RemodelAge with DiffAge: "+str(RemodelAge.corr(DiffAge)))
print("SalePrice with DiffAge: "+str(train.SalePrice.corr(DiffAge)))
DiffAge isn't too well correlated with SalePrice, but multicollinearity is no longer an issue here.
Based on what we found, let's remove the following features: YearBuilt, GarageYrBlt, YearAddRemod, YrSold
And replace them with the following features:
def age_features(df):
df_new = df.copy()
df_new['HouseAge'] = df.YrSold-df.YearBuilt
df_new['DiffGarageAge'] = df.GarageYrBlt-df.YearBuilt
df_new['RemodelAge'] = df.YearRemodAdd-df.YearBuilt
df_new = df_new.drop(columns=['YearBuilt','GarageYrBlt','YearRemodAdd','YrSold'])
return df_new
train_old = train.copy()
test_old = test.copy()
train = age_features(train_old)
test = age_features(test_old)
train[['HouseAge','DiffGarageAge','RemodelAge']].describe()
test[['HouseAge','DiffGarageAge','RemodelAge']].describe()
plt.scatter(train.TotRmsAbvGrd, train.GrLivArea)
Can we create a better feature from these two? Both correlate positively with SalePrice:
train.SalePrice.corr(train.TotRmsAbvGrd)
train.SalePrice.corr(train.GrLivArea)
plt.scatter(train.TotRmsAbvGrd, train.SalePrice)
plt.scatter(train.GrLivArea, train.SalePrice)
Let's try average room size (GrLivArea/TotRmsAbvGrd) and GrLivArea multiplied by TotRmsAbvGrd:
train.SalePrice.corr(train.GrLivArea/train.TotRmsAbvGrd)
train.SalePrice.corr(train.GrLivArea*train.TotRmsAbvGrd)
Neither are as well correlated with SalePrice as GrLivArea alone. I can't think of a physically intuitive way to combine these two features, and both seem to be relatively good predictors of SalePrice. Let's just leave them in for now, and see if they drop out during feature selection when we start to fit models.
plt.scatter(train.Fireplaces, train.FireplaceQu)
train[['Fireplaces','FireplaceQu','SalePrice']].groupby(['Fireplaces','FireplaceQu']).count().reset_index()
train.SalePrice.corr(train.Fireplaces)
train.SalePrice.corr(train.FireplaceQu)
First observation is that there are few houses with three fireplaces, and as shown in the scatterplots before, the presence of a third fireplace doesn't correlate with a higher SalePrice. Let's treat 3 fireplaces and 2 fireplaces as the same (so essentially changing Fireplaces from number of fireplaces to an indicator for an extra fireplace: 0 = no extra fireplace, 1 = at least one extra fireplace), and add on the FireplaceQu score:
FireplaceScr = train.Fireplaces.apply(lambda x:x if x==0 else (x-2 if x>2 else x-1))+train.FireplaceQu
train.SalePrice.corr(FireplaceScr)
This is a marginally better feature, but more importantly it combines the two fireplace features into one, eliminating the possibility of multicollinearity. The FireplaceScore equals the FireplaceQu plus one additional point if there is an extra fireplace.
train['FireplaceScr'] = train.Fireplaces.apply(lambda x:x if x==0 else (x-2 if x>2 else x-1))+train.FireplaceQu
test['FireplaceScr'] = test.Fireplaces.apply(lambda x:x if x==0 else (x-2 if x>2 else x-1))+test.FireplaceQu
train = train.drop(columns=['Fireplaces','FireplaceQu'])
test = test.drop(columns=['Fireplaces','FireplaceQu'])
plt.scatter(train.GarageQual, train.GarageCond)
train.SalePrice.corr(train.GarageQual)
train.SalePrice.corr(train.GarageCond)
Can we just add the two to get a better predictor?
train.SalePrice.corr(train.GarageQual+train.GarageCond)
plt.scatter(train.GarageQual+train.GarageCond, train.SalePrice)
There isn't a huge loss in correlation, and combining the two gets rid of any multicollinearity. Let's go ahead with it:
def add_garage_qual_feature(df):
df_new = df.copy()
df_new['GarageQualCond'] = df.GarageQual+df.GarageCond
df_new = df_new.drop(columns=['GarageQual','GarageCond'])
return df_new
train_old = train.copy()
test_old = test.copy()
train = add_garage_qual_feature(train_old)
test = add_garage_qual_feature(test_old)
train.GarageQualCond.describe()
test.GarageQualCond.describe()
plt.scatter(train.GarageCars, train.GarageArea)
Like TotRmsAbvGrd and GrLivArea, it makes sense that these two are correlated.
train.SalePrice.corr(train.GarageCars)
train.SalePrice.corr(train.GarageArea)
These two features are especially well correlated - on their own, they would contribute to a VIF of 4.4 (nearly already at the threshold of 5 where we would consider combining or dropping them).
Let's try average garage space size (GarageArea/GarageCars):
train.SalePrice.corr(train.GarageArea/train.GarageCars)
GarageArea times GarageCars gives a slightly better correlation. It doesn't have much physical sense, but is easy to explain and calculate. Let's go with it for now:
train.SalePrice.corr(train.GarageArea*train.GarageCars)
train['GarageAreaCars'] = train.GarageArea*train.GarageCars
test['GarageAreaCars'] = test.GarageArea*test.GarageCars
train = train.drop(columns=['GarageArea','GarageCars'])
test = test.drop(columns=['GarageArea','GarageCars'])
Since there are so few pools in the dataset, let's just remove PoolQC and PoolArea, and create an indicator variable for if a pool exists (0 = no pool, 1 = pool):
def add_pool_feature(df):
df_new = df.copy()
df_new['Pool'] = df.PoolArea.apply(lambda x:0 if x==0 else 1)
df_new = df_new.drop(columns=['PoolQC','PoolArea'])
return df_new
train_old = train.copy()
test_old = test.copy()
train = add_pool_feature(train)
test = add_pool_feature(test)
train.Pool.value_counts()
test.Pool.value_counts()
ord_features.append('Pool')
Before going onto other new features, check that most of our multicollinearity has been solved:
num_cols = [x for x in train.columns if x not in cat_cols]
corr_matrix(train, num_cols)
The only remaining traces of possible multicollinearity are for pairs we already analyzed or for pairs where the next steps in feature engineering may solve things. Let's proceed:
Houses are commonly listed with total number of bathrooms (e.g. 3.5) rather than breaking them out by full/half and location (e.g. 2 full baths, 0.5 half baths, 1 full bath in the basement). Let's see if correlation with SalePrice does better with total bathrooms:
TotalBath = train.FullBath+0.5*train.HalfBath+train.BsmtFullBath+0.5*train.BsmtHalfBath
TotalBath.corr(train.SalePrice)
bath_features = ['FullBath','HalfBath','BsmtFullBath','BsmtHalfBath']
train_corr[bath_features]
plt.scatter(TotalBath,train.SalePrice)
We should keep the two outliers in mind here (more bathrooms but lower SalePrice), but overall the combined feature looks like a better predictor than each of the individual features. It is also more explainable and relates better to how houses are actually listed. Let's keep it and drop the rest:
train['TotalBath'] = train.FullBath+0.5*train.HalfBath+train.BsmtFullBath+0.5*train.BsmtHalfBath
test['TotalBath'] = test.FullBath+0.5*test.HalfBath+test.BsmtFullBath+0.5*test.BsmtHalfBath
train = train.drop(columns=bath_features)
test = test.drop(columns=bath_features)
train.TotalBath.describe()
test.TotalBath.describe()
We can also bring in BsmtFinType1 and BsmtFinType2 as potential weighting factors here.
First let's see how well each correlates with SalePrice:
sf_features = ['1stFlrSF', '2ndFlrSF', 'TotalBsmtSF', 'LowQualFinSF', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF']
for feature in sf_features:
print(feature+" correlation with SalePrice: "+str(train.SalePrice.corr(train[feature])))
def corr_sf(df, sf_features):
return(df.SalePrice.corr(train[sf_features].sum(axis=1)))
Try total SF first (1stFlr + 2ndFlr + TotalBsmt):
corr_sf(train,['1stFlrSF','2ndFlrSF','TotalBsmtSF'])
Looks like a much better predictor, let's try just finished square footage (1stFlr + 2ndFlr + BsmtFin1 + BsmtFin2):
corr_sf(train,['1stFlrSF','2ndFlrSF','BsmtFinSF1','BsmtFinSF2'])
Not as good, what if we subtract off LowQualFinSF from the total:
train.SalePrice.corr(train[['1stFlrSF','2ndFlrSF','TotalBsmtSF']].sum(axis=1)-train['LowQualFinSF'])
This is likely the best predictor we can create out of the square footage variables, let's go for it:
train['HighQualSF'] = train[['1stFlrSF','2ndFlrSF','TotalBsmtSF']].sum(axis=1)-train['LowQualFinSF']
test['HighQualSF'] = test[['1stFlrSF','2ndFlrSF','TotalBsmtSF']].sum(axis=1)-test['LowQualFinSF']
train = train.drop(columns=['1stFlrSF','2ndFlrSF','TotalBsmtSF','LowQualFinSF'])
test = test.drop(columns=['1stFlrSF','2ndFlrSF','TotalBsmtSF','LowQualFinSF'])
We should take a closer look at the basement SF and finish features:
bsmt_features = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','BsmtFinType1','BsmtFinType2']
for feature in bsmt_features:
print(feature+" correlation with SalePrice: "+str(train.SalePrice.corr(train[feature])))
If we just sum the bsmt square footages, we should get the same correlation compared to TotalBsmtSF:
corr_sf(train,['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'])
train.SalePrice.corr(train[['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF']].sum(axis=1)-train.BsmtUnfSF)
Can we get a better predictor than this? Let's try using the BsmtFinType features as "weights". We'll only count square footages for basements that can be used as living spaces similarly to above ground rooms (BsmtFinType = GLQ, ALQ or BLQ), then add BsmtUnfSF to the "livable" basement square footage. In other words, we're dropping low quality space, similar to what we did for total square footage:
BsmtSFWeight = train.BsmtFinSF1*train.BsmtFinType1.apply(lambda x:0 if x < 4 else 1) + train.BsmtFinSF2*train.BsmtFinType2.apply(lambda x:0 if x < 4 else 1) + train.BsmtUnfSF
train.SalePrice.corr(BsmtSFWeight)
This is a slightly better feature than TotalBsmtSF, and consistent with the method we used to quantify total square footage. It also has the advantage of combining 5 features into one, improving model interpretability.
train['HighQualBsmtSF'] = train.BsmtFinSF1*train.BsmtFinType1.apply(lambda x:0 if x < 4 else 1) + train.BsmtFinSF2*train.BsmtFinType2.apply(lambda x:0 if x < 4 else 1) + train.BsmtUnfSF
test['HighQualBsmtSF'] = test.BsmtFinSF1*test.BsmtFinType1.apply(lambda x:0 if x < 4 else 1) + test.BsmtFinSF2*test.BsmtFinType2.apply(lambda x:0 if x < 4 else 1) + test.BsmtUnfSF
train = train.drop(columns=bsmt_features)
test = test.drop(columns=bsmt_features)
train.HighQualBsmtSF.describe()
test.HighQualBsmtSF.describe()
One last thing, can we combine HighQualBsmtSF with BsmtQual to get a proxy for high quality basement volume?
train.BsmtQual.value_counts()
Let's map BsmtQual to actual measures: 2 = 75 inches, 3 = 85 inches, 4 = 95 inches, 5 = 105 inches)
dict_BsmtQual = {0:0, 2:75/12, 3:85/12, 4:95/12, 5:105/12}
BsmtQual_new = train.BsmtQual.map(lambda x:dict_BsmtQual[x])
BsmtQual_new.value_counts()
HighQualBsmtVol = train.HighQualBsmtSF*BsmtQual_new
train.SalePrice.corr(HighQualBsmtVol)
This is a better feature than both HighQualBsmtSF and BsmtQual individually. Let's go with it:
train['HighQualBsmtVol'] = train.BsmtQual.map(lambda x:dict_BsmtQual[x])*train.HighQualBsmtSF
test['HighQualBsmtVol'] = test.BsmtQual.map(lambda x:dict_BsmtQual[x])*test.HighQualBsmtSF
train = train.drop(columns=['BsmtQual','HighQualBsmtSF'])
test = test.drop(columns=['BsmtQual','HighQualBsmtSF'])
It would make most sense to calculate a total deck/porch square footage and see how that correlates with SalePrice vs. each individual feature:
porch_features = ['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch']
for feature in porch_features:
print(feature+" correlation with SalePrice: "+str(train.SalePrice.corr(train[feature])))
DeckPorchSF = train[porch_features].sum(axis=1)
train.SalePrice.corr(DeckPorchSF)
This does seem to be a better feature than the 5 individual features. Let's go ahead and replace them:
train['DeckPorchSF'] = train[porch_features].sum(axis=1)
test['DeckPorchSF'] = test[porch_features].sum(axis=1)
train = train.drop(columns=porch_features)
test = test.drop(columns=porch_features)
train.DeckPorchSF.describe()
test.DeckPorchSF.describe()
Looking at the value counts for these categorical variables again:
train.Utilities.value_counts()
test.Utilities.value_counts()
train.RoofMatl.value_counts()
test.RoofMatl.value_counts()
train.Heating.value_counts()
test.Heating.value_counts()
train.Street.value_counts()
test.Street.value_counts()
The values in these fields are predominantly of one value, offering little value to the model. Dropping them:
train = train.drop(columns=['Utilities','RoofMatl','Heating','Street'])
test = test.drop(columns=['Utilities','RoofMatl','Heating','Street'])
Is MiscFeature really necessary when we already have MiscVal?
train.SalePrice.corr(train.MiscVal)
sns.boxplot(train.MiscFeature, train.SalePrice)
train.MiscFeature.value_counts()
test.MiscFeature.value_counts()
Most houses don't have a misc feature anyways, and the value of the features is already in MiscVal. Let's drop MiscFeature:
train = train.drop(columns=['MiscFeature'])
test = test.drop(columns=['MiscFeature'])
train.columns
First let's pull out and look at the target variable, SalePrice:
sale_price = train.SalePrice
train = train.drop(columns=['SalePrice'])
sns.distplot(sale_price)
sale_price.skew()
SalePrice looks positively skewed, and the skewness of 1.9 confirms that it is positively skewed. A log transform would be helpful here:
log_sale_price = np.log1p(sale_price)
sns.distplot(log_sale_price)
log_sale_price.skew()
Looks a lot better. The log of SalePrice is our new target variable.
train.dtypes
Now look at the skewness of each feature. Those with high skewness are candidates for a Box-Cox transformation. Since the same transformation needs to be applied to all data, let's concatenate the train and test sets first:
all_data = pd.concat([train,test]).reset_index(drop=True)
all_data.shape
train.index[-1]
num_cols = all_data.select_dtypes(include=['int64','float64']).columns
num_cols = [x for x in num_cols if x not in ord_features]
all_data_num = all_data[num_cols]
all_data_skew = all_data_num.apply(lambda x:x.skew()).sort_values(ascending=False)
all_data_skew
features_skew = list(all_data_skew[np.absolute(all_data_skew) > 0.5].index)
features_skew
for feature in features_skew:
if all_data[feature].min() <= 0:
all_data[feature] = all_data[feature]-all_data[feature].min()+1
from scipy.stats import boxcox
for feature in features_skew:
all_data[feature],_ = boxcox(all_data[feature])
all_data_num = all_data[num_cols]
all_data_skew = all_data_num.apply(lambda x:x.skew()).sort_values(ascending=False)
all_data_skew
Looks a lot better. MiscVal and DiffGarageAge are still pretty skewed, likely because so many values were zero to begin with. Visualizing the transformed features:
def distplots(df, cols, ncol_plot=3):
nrow_plot = int(np.ceil(len(cols)/ncol_plot))
fig, ax = plt.subplots(nrow_plot, ncol_plot, figsize=(16,16/ncol_plot*nrow_plot))
r = 0
c = 0
for col in cols:
sns.distplot(df[col], ax=ax[r,c])
ax[r,c].set_title(col)
c += 1
if c == ncol_plot:
c = 0
r += 1
distplots(all_data, num_cols)
all_data = pd.get_dummies(all_data).reset_index(drop=True)
all_data.head()
all_data.shape
Split back into train and test sets:
train = all_data.iloc[:1456]
train.shape
log_sale_price.shape
train.tail()
test = all_data.iloc[1456:]
test.shape
test.head()
pd.set_option('display.max_rows',250)
pd.set_option('display.max_columns',250)
train.head()
We can get rid of the following dummy columns, because they are already represented in the dataset:
train = train.drop(columns=['MasVnrType_None','GarageType_None'])
test = test.drop(columns=['MasVnrType_None','GarageType_None'])
We can also combine the Condition1/2 and Exterior1/2 columns into one feature per attribute. For example if a house has Condition1 = PosA and Condition2 = PosN, just have one column each for Condition_PosA and Condition_PosN, with a 1 in each of those columns.
condition_cols = [col for col in train.columns if col.startswith('Condition')]
exterior_cols = [col for col in train.columns if col.startswith('Exterior')]
conditions = [col.split("_")[1] for col in condition_cols]
conditions = list(dict.fromkeys(conditions))
exteriors = [col.split("_")[1] for col in exterior_cols]
exteriors = list(dict.fromkeys(exteriors))
conditions
exteriors
for cond in conditions:
cols = []
cols = [col for col in train.columns if (col.endswith(cond) and col.startswith('Condition'))]
train['Condition_'+cond] = train[cols].sum(axis=1).apply(lambda x:1 if x>1 else x)
test['Condition_'+cond] = test[cols].sum(axis=1).apply(lambda x:1 if x>1 else x)
for ext in exteriors:
cols = []
cols = [col for col in train.columns if (col.endswith(ext) and col.startswith('Exterior'))]
train['Exterior_'+ext] = train[cols].sum(axis=1).apply(lambda x:1 if x>1 else x)
test['Exterior_'+ext] = test[cols].sum(axis=1).apply(lambda x:1 if x>1 else x)
train.head()
Looks good, let's drop the original condition and exterior columns:
train = train.drop(columns=condition_cols)
train = train.drop(columns=exterior_cols)
test = test.drop(columns=condition_cols)
test = test.drop(columns=exterior_cols)
train.head()
One last thing - combine the WdShing and Wd Shng columns, because these are supposed to be the same:
train['Exterior_WdShing'] = train[['Exterior_WdShing','Exterior_Wd Shng']].sum(axis=1).apply(lambda x:1 if x>1 else x)
test['Exterior_WdShing'] = test[['Exterior_WdShing','Exterior_Wd Shng']].sum(axis=1).apply(lambda x:1 if x>1 else x)
train = train.drop(columns=['Exterior_Wd Shng'])
test = test.drop(columns=['Exterior_Wd Shng'])
train.head()
train.shape
I originally used an entire section for feature removal and dimensionality reduction, which removed features based on VIF and then applied PCA. However, I decided to remove it because I was not getting as accurate of a result (my submission using VIF and PCA got me to top 30%, while this submission got me to top 19%). I have these sections in a separate notebook that I will not submit, but will have available for reference in case. The lesson learned here is that doing feature removal prior to using regularized models may result in too much information lost from the original dataset.
A popular way to approach Kaggle competitions is model stacking, where several different models are used and weighted to arrive at a final result. For practice with all of the different types of models we have learned so far, I'll find the best parameters and stack models for the following types of models that can be used for regression:
Let's start by using cross-validation on each model to determine the best parameters, observe the performance of each individual model, and stack them all at the end.
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_squared_log_error as msle
Function to calculate RMSE between the prediction and actual values:
def rmse(model, X, y):
return np.sqrt(mse(y, model.predict(X)))
Function to get the mean CV RMSE, which will be the best indicator of how the model does on the test set (i.e., how it does on data it is not trained on):
def rmse_cv(model, X, y):
rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
return np.mean(rmse)
Function to perform grid search CV, select and fit the best model, and print the R^2, RMSE, and CV RMSE:
def model_cv_fit(model, param_grid, X, y):
cv = GridSearchCV(estimator=model, param_grid=param_grid)
cv.fit(X,y)
print(cv.best_params_)
model_best = cv.best_estimator_
model_best.fit(X,y)
print("R^2: " + str(model_best.score(X,y)))
print("RMSE: " + str(rmse(model_best,X,y)))
print("Avg CV RMSE: " + str(rmse_cv(model_best,X,y)))
return model_best
For Lasso, just need to fit the parameter alpha (inverse of lambda, so smaller alpha means more penalization).
alpha = [0.0003,0.0004,0.0005]
param_grid = {'lasso__alpha':alpha}
lasso = Pipeline(steps=[('scaler',RobustScaler()), ('lasso',Lasso())])
lasso_best = model_cv_fit(lasso, param_grid, train, log_sale_price)
Ridge is tuned in the same way as Lasso. The only difference between the two models is how it penalizes: Lasso uses the L1 norm (Manhattan distance), while Ridge uses the L2 norm (Euclidean distance).
alpha = [7.9,8,8.1]
param_grid = {'ridge__alpha':alpha}
ridge = Pipeline(steps=[('scaler',RobustScaler()), ('ridge',Ridge())])
ridge_best = model_cv_fit(ridge, param_grid, train, log_sale_price)
Elastic Net requires tuning an additional parameter, rho, for the weighting between Lasso and Ridge.
alpha = [0.0004,0.0005,0.0006]
rho = [0.82,0.83,0.84]
param_grid = {'enet__alpha':alpha,'enet__l1_ratio':rho}
enet = Pipeline(steps=[('scaler',RobustScaler()), ('enet',ElasticNet())])
enet_best = model_cv_fit(enet, param_grid, train, log_sale_price)
SVR requires more tuning, and the specific parameters will depend on the best type of kernel. Let's just assume the default rbf kernel, and fit to those parameters. The parameters that need to be tuned are gamma (controls the "shape" of the kernel), and C and epsilon (both control how "wide" the support vector margin is).
C = [29] # also tried 28, 30
epsilon = [0.054] # also tried 0.053, 0.055
gamma = [5e-5] # also tried 4e-5, 6e-5
param_grid = {'C':C, 'epsilon':epsilon, 'gamma':gamma}
svr_best = model_cv_fit(SVR(kernel='rbf'), param_grid, train, log_sale_price)
For RF I will tune three parameters: n_estimators (number of individual trees in the random forest), max_depth (maximum depth of a single tree in the ensemble), and min_samples_leaf (minimum number of samples at each leaf node).
n_estimators = [300] # Also tried 200, 400
max_depth = [10] # Also tried 9, 11
min_samples_leaf = [3] # Also tried 1, 2, 4
param_grid = {'n_estimators':n_estimators,'min_samples_leaf':min_samples_leaf,'max_depth':max_depth}
rf_best = model_cv_fit(RandomForestRegressor(), param_grid, train, log_sale_price)
For all three gradient boosting methods, tune learning_rate (how large of a step gradient boosting takes in between iterations), n_estimators (number of individual trees in the ensemble), max_depth (maximum depth of an individual tree), and subsample (fraction of samples used in the fitting of each tree, which is essentially regularization).
learning_rate = [0.05] # Also tried 0.025, 0.075
n_estimators = [300] # Also tried 250, 350
max_depth = [3] # Also tried 2, 4
min_samples_leaf = [4] # Also tried 3, 5
subsample = [0.8] # Also tried 0.7, 0.9
param_grid = {'learning_rate':learning_rate,'n_estimators':n_estimators,'max_depth':max_depth,'min_samples_leaf':min_samples_leaf,'subsample':subsample}
gbm_best = model_cv_fit(GradientBoostingRegressor(loss='huber', random_state=0), param_grid, train, log_sale_price)
learning_rate = [0.05] # Also tried 0.025, 0.075
n_estimators = [800] # Also tried, 700, 900
max_depth = [3] # Also tried 2, 4
subsample = [0.8] # Also tried 0.7, 0.9
param_grid = {'learning_rate':learning_rate,'n_estimators':n_estimators,'max_depth':max_depth,'subsample':subsample}
lgbm_best = model_cv_fit(LGBMRegressor(), param_grid, train, log_sale_price)
learning_rate = [0.05] # Also tried 0.025, 0.075
n_estimators = [500] # Also tried 300, 500
max_depth = [3] # Also tried 2, 4
subsample = [0.8] # Also tried 0.7, 0.9
param_grid = {'learning_rate':learning_rate,'n_estimators':n_estimators,'max_depth':max_depth,'subsample':subsample}
xgb_best = model_cv_fit(XGBRegressor(), param_grid, train, log_sale_price)
I did not use SVR or RF in the stacking, since those two were had the weakest performance.
from sklearn.ensemble import StackingRegressor
stack = StackingRegressor(estimators=[('lasso',lasso_best), ('ridge',ridge_best), ('enet',enet_best), ('gbm',gbm_best), ('lgbm',lgbm_best), ('xgb',xgb_best)], cv=10)
stack.fit(train, log_sale_price)
print("R^2: " + str(stack.score(train, log_sale_price)))
rmse_cv(stack, train, log_sale_price)
This admittedly took a lot of trial and error with submissions to Kaggle, after many attempts I figured out that LGBM and Ridge were the most successful standalone models, so I gave them higher weights in the final blending. This submission achieved top 19% (about 1000th out of 5200 submissions).
submission_weights = pd.DataFrame()
submission_weights['ID'] = test.index.values+5
w_stack = 0.35
w_xgb = 0.05
w_lgbm = 0.25
w_gbm = 0.05
w_rf = 0
w_svr = 0
w_enet = 0.05
w_ridge = 0.2
w_lasso = 0.05
print(w_stack+w_xgb+w_lgbm+w_gbm+w_rf+w_svr+w_enet+w_ridge+w_lasso)
submission_weights['SalePrice'] = np.exp(w_stack*stack.predict(test) + w_xgb*xgb_best.predict(test) + w_lgbm*lgbm_best.predict(test) + w_gbm*gbm_best.predict(test) + w_rf*rf_best.predict(test) + w_svr*svr_best.predict(test) + w_enet*enet_best.predict(test) + w_ridge*ridge_best.predict(test) + w_lasso*lasso_best.predict(test))
submission_weights.to_csv('submit.csv', index=False)
The model predicts reasonably well, but we still don't know what exactly drives housing prices. To look more into this, let's look at the regression coefficients of the linear models and the feature importance of the gradient boosting models. Since Ridge and LGBM ended up performing the best on the test set, I'll use those to visualize importance.
ridge_coefs = pd.DataFrame()
ridge_coefs['feature'] = train.columns
ridge_coefs['coef'] = ridge_best['ridge'].coef_
ridge_coefs.sort_values(by='coef',ascending=False)
from lightgbm import plot_importance
fig, ax = plt.subplots(1,1,figsize=(16,30))
plot_importance(lgbm_best, ax=ax)
Some observations here:
With more time, I would try to do the following:
With more data, I would try to do the following: